Example of using ptm_pred to prototype phosphorylation classifiers
Histadine Phosphorylation is a quick place to start, not much data though. However, that means the code runs much faster.
Predictor is the class which handles reading the data, sequence vector is a function which vectorizes a protien sequence into a feature array representing amino acids as integer values between 0-20. 0 represents empty space to average out vector length. It can also include hydrophobicity as a feature.
In [1]:
from pred import Predictor
from pred import sequence_vector
Next we are going to load our data and generate random negative data aka gibberish data. The clean data files has negatives created from the data sets pulled from phosphoELM and dbptm.
In generate_random_data the amino acid parameter represents the amino acid being modified aka the target amino acid modification, the float being passed through is multiplier. For example we use .5 here, that means that .5 * number of data points = random negatives generated.
In [2]:
y = Predictor()
y.load_data(file="Data/Training/clean_Y.csv")
Next we vectorize the sequences, we are going to use the sequence vector. Now we can apply a data balancing function, here we are using adasyn which generates synthetic examples of the minority (in this case positive) class.
In [3]:
y.process_data(vector_function="sequence", amino_acid="Y", imbalance_function="ADASYN", random_data=0)
Now we can apply a data balancing function, here we are using adasyn which generates synthetic examples of the minority (in this case positive) class.
The array outputed contains the precision, recall, fscore, and total numbers correctly estimated.
In [4]:
y.supervised_training("mlp_adam")
Next we can check against the benchmarks pulled from dbptm.
In [ ]:
y.benchmark("Data/Benchmarks/phos.csv", "Y")
Want to explore the data some more, easily generate PCA and TSNE diagrams of the training set.
In [ ]:
y.generate_pca()
In [ ]:
y.generate_tsne()
There you have it, you have prototype a Tyrosine classifier.
In [ ]: